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This paper is devoted to studying the average optimality in con¬ 
tinuous-time Markov decision processes with fairly general state and 
action spaces. The criterion to be maximized is expected average 
rewards. The transition rates of underlying continuous-time jump 
Markov processes are allowed to be unbounded, and the reward rates 
may have neither upper nor lower bounds. We first provide two op¬ 
timality inequalities with opposed directions, and also give suitable 
conditions under which the existence of solutions to the two optimal¬ 
ity inequalities is ensured. Then, from the two optimality inequalities 
we prove the existence of optimal (deterministic) stationary policies 
by using the Dynkin formula. Moreover, we present a “semimartin¬ 
gale characterization” of an optimal stationary policy. Finally, we use 
a generalized Potlach process with control to illustrate the difference 
between our conditions and those in the previous literature, and then 
further apply our results to average optimal control problems of gen¬ 
eralized birth-death systems, upwardly skip-free processes and two 
queueing systems. The approach developed in this paper is slightly 
different from the “optimality inequality approach” widely used in 
the previous literature. 

1. Introduction. Continuous-time Markov decision processes (MDPs) 
have received considerable attention because many optimization models such 
as those in telecommunication and queueing systems are based on the pro¬ 
cesses involving continuous time. One of the most common optimality crite¬ 
rion in continuous-time MDPs is the expected average criterion, which has 
been studied by many authors. In this paper we are also concerned with this 
expected average criterion. As is well known, continuous-time MDPs can be 
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specified by four primitive data: a state space S'; an action space A with 
subsets A{x) of admissible actions, which may depend on the current state 
X € S; transition rates q{-\x,a); and reward (or cost) rates r{x,a). Using 
these terms, we now briefly describe some existing works on the expected 
average criterion. When the state space is finite, a bounded solution to the 
average optimality equation (AOE) and methods for computing optimal sta¬ 
tionary policies have been investigated in [23, 26, 30]. Since then, most work 
has focused on the case of a denumerable state space; for instance, see [6, 24] 
for bounded transition and reward rates, [18, 27, 31, 34, 39, 41] for bounded 
transition rates but unbounded reward rates, [16, 35] for unbounded transi¬ 
tion rates but bounded reward rates and [12, 13, 17] for unbounded transition 
and reward rates. For the case of an arbitrary state space, to the best of our 
knowledge, only Doshi [5] and Hernandez-Lerma [19] have addressed this is¬ 
sue. They ensured the existence of optimal stationary policies. However, the 
treatments in [5] and [19] are restricted to uniformly bounded reward rates 
and nonnegative cost rates, respectively, and the AOE plays a key role in 
the proof of the existence of average optimal policies. Moreover, to establish 
the AOE, Doshi [5] needed the hypothesis that all admissible action sets are 
finite and the relative difference of the optimal discounted value function 
is equicontinuous, whereas in [19] the assumption about the existence of a 
solution to the AOE is imposed. On the other hand, it is worth mentioning 
that some of the conditions in [5, 19] are imposed on the family of weak 
infinitesimal operators deduced from all admissible policies, instead of the 
primitive data. In this paper we study the much more general case. That is, 
the reward rates may have neither upper nor lower bounds, all of the state 
and action spaces are fairly general and the transition rates are allowed to 
be unbounded. We first provide two optimality inequalities rather than one 
for the “optimality inequality approach” used in [16, 19], for instance. Under 
suitable assumptions we not only prove the existence of solutions to the two 
optimality inequalities, but also ensure the existence of optimal stationary 
policies by using the two inequalities and the Dynkin formula. Also, to verify 
our assumptions, we further give sufficient conditions which are imposed on 
the primitive data. Moreover, we present a semimartingale characterization 
of an optimal stationary policy. Finally, we use controlled generalized Pot- 
lach processes [4, 22] to show that all conditions in this paper are satisfied, 
whereas the earlier conditions fail to hold. Then we further apply our results 
to average optimal control problems of generalized birth-death systems and 
upwardly skip-free processes [1] , a pair of controlled queues in tandem [28] , 
and MfMfNfO queue systems [25, 40]. It should be noted that, on the one 
hand, the optimality inequality approach used in the previous literature (see, 
e.g., [16, 19] for continuous-time MDPs and [20, 21, 31, 34] for discrete-time 
MDPs) is not applied to our case, because in our model the reward rates 
may have neither upper nor lower bounds. On the other hand, we not only 
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replace the AOE with two optimality inequalities, but also relax the con¬ 
dition of the equicontinuity of the relative difference of optimal discounted 
value functions [5]. Therefore, the approach developed in this paper can be 
regarded as a modification of the optimality inequality approach widely used 
in the previous literature. 

The rest of this paper is organized as follows. In Section 2 we introduce 
the optimal control problem. Our main results are given in Section 4 after 
some technical preliminaries in Section 3. We illustrated with examples our 
conditions and results in Section 5, and conclude in Section 6 with some 
general remarks. 

2. The optimal control problem. 

Notation. If AC is a Polish space (i.e., a complete and separable metric 
space), we denote by 13{X) the Borel ci-algebra. 

The model of continuous-time MDPs with which we are concerned is of 
the form 

(2.1) {5, (A(x) C A),q{-\x,a),r{x,a)} 

with the following components. 

• The variable S is the state space —a Polish space. 

• The term A{x) is a Borel subset of A which denotes the set of admissible 
actions at state x € S, where A is the action space —a Polish space too. The 
set 

(2.2) K := {{x, a)\x ^ S, a ^ A{x)} 

of pairs of states and actions is assumed to be a Borel subset of 5 x A. 

• The element q{-\x,a) in (2.1) denotes the transition rates, which satisfy 
the following properties for each {x, a) € K and D G B{S): 

Pi: The element q{-\x, a) is a signed measure on B{S), and q{D\-, •) is Borel 
measurable on K. 

P 2 : For all x ^ D G B{S), 0 < q{D\x, a) < 00 , 

P3 : There exists (/(jSla:, a) = 0,0 < —q{{x}\x, a) < 00 . 

Furthermore, the model is assumed to be stable, that is, 

(2.3) q{x) := sup (—g({x}|x, a)) < 00 \/xGS. 

aGA{x) 

• The real-valued function r{x,a) denotes the reward rates and it is as¬ 
sumed to be measurable on K. [Whereas r{x,a) is allowed to take positive 
and negative values, it can be interpreted as a cost rate rather than a “re¬ 
ward” rate.] 


4 


X. GUO AND U. RIEDER 


We now define a randomized Markov policy. 


Definition 2.1 (Randomized Markov policies). Let $ be the set of 
functions 'Kt{B\x) on [0,oo) x B{A) x S such that: 

(1) For each f > 0, Trt{-\x) is a stochastic kernel on A given S such that 
7rt{A{x)\x) = 1 for all x £ S. 

(2) For each B £ B{A) and x £ S, 7rt{B\x) is a Borel measurable function 
in f > 0. 

A function TTt{B\x) in <1> is called a randomized Markov policy. We will 
write 'Kt{B\x) simply as (vr^). The subscript “f” in vr^ indicates the possible 
dependence on time. A randomized Markov policy vr := (vrt) G is called 
(deterministic) stationary if there exists a Borel measurable function / on 
S such that 

f{x)£A{x) and 7rt({/(x)}|x) = 1 Vt > 0 and x G S'. 

For simplicity, we denote by / this stationary policy tt. The set of all station¬ 
ary policies is denoted by F; this means that F is the set of all measurable 
functions f on S with f{x) £ A{x) for all x £ S. Obviously, F C d>. 


By (1) above, w.l.o.g. we also regard 'Xt{-\x) as a probability measure on 
A{x). Thus, for each fixed policy tt = (vr^) G $, the associated transition rates 
q{D\x.,'Kt) can be defined by 


(2.4) 


g(F|x,7rt) := / 

Ja{x) 


q{D\x, a)iTt{da\x) 


for each x £ S,D £ B{S) and f > 0. 


In particular, when vr is stationary (i.e., tt =: / G F), we write the left-hand 
side of (2.4) as q{D\x, f{x)). Then q{D\x,Trt) is called an infinitesimal gener¬ 
ator [for any fixed policy vr = (vr^) G <!>]; see [5, 24], for instance. Its equivalent 
form can be found in [8]. As is well known, any (possibly substochastic and 
nonhomogeneous) transition function pF{s,xfi,D) that depends on vr such 
that 


lim 

e^0+ 


fF{t,xfi + e,D) - Ipjx) 
e 


q{D\x,7rt) 


for all X G F, F G B{S) and f > 0 is called a Q-process with transition rates 
g(F|x,7ri), where Id{x) is the indication function of set D. 

To guarantee the existence of such a Q-process, we need to introduce the 
class of admissible policies. 


Definition 2.2 (An admissible policy). A policy (vr^) in <I> is said to be 
admissible if for each x £ S the functions h{a)7rt{da\x) are continuous 
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in t > 0 for all bounded measurable functions h on A{x). We denote by IT 
the class of all admissible policies. Observe that IT is nonempty because it 
contains F. Moreover, it is easy to provide an example for which IT can be 
chosen to be strictly larger than F. 

By P 1 -P 3 , (2.3), (2.4) and Definition 2.2, we have the following facts. 

Lemma 2.1. For each vr := (vrt) G IT, the following statements hold. 

(a) For eaeh x ^ S,t>0 and D G B{S) : 

(ai) q{D\x,TTt) is a signed measure in D £ B{S); 

{s- 2 ) 0 < q{D\x,TTt) < 00 when x ^ D; 

(as) q{S\x,7rt) = 0, 0 <-g({x}|x, vr*) < 00 ; 

(a 4 ) q{D\x,TTt) is continuous in t>0 and measurable in x £ S. 

(b) There exists a Q-process p'^{s,x,t, D) with transition rates q{D\x,7rt). 

Proof. Parts (ai)-(a 3 ) follow from (2.4) and the definition of model 
(2.1), while part (a 4 ) follows from (2.3) and Definition 2.2. By (a) and The¬ 
orem 2 in [ 8 ], we see that (b) is also true. □ 

Lemma 2.1(b) guarantees the existence of a Q-process such as the min¬ 
imum Q-process pf^^^{s,x,t,D) [i.e., p^:^^{s,x,t, D) < p'^{s,x,t, D) for any 
Q-process p'^{s,x,t, D)], which can be directly constructed from the transi¬ 
tion rates q{D\x, vrt); see [ 8 , 11], for instance. However, as is known [4, 8 ], such 
a Q-pi'ocess might not be regular; that is, we might have x, t, S) < 1 

for some x £ S and t> s>0. 

To ensure the regularity of a Q’Process, we use the “drift” conditions 
below. 

Assumption A. There exist a measurable function tc > 1 on S', con¬ 
stants c> 0,6 > 0 and Mq > 0 , such that: 

(1) For all (x, a) G AT, and Jgw{y)q{dy\x,a) <—cw{x)+b. 

(2) For all x £ S, with q{x) as in (2.3), q{x) < Mqw{x). 

Remark 2.1. (a) For the case of uniformly bounded transition rates 

[i.e., sup 3 ,g 5 ( 7 (x) < 00 ], Assumption A( 2 ) is not required because it is only 
used to guarantee the regularity of a Q-process. 

(b) Assumption A(l) is used not only for the regularity of a (possibly 
nonhomogeneous) Q-process, but also for the finiteness of the expected av¬ 
erage criterion (2.6) below. Moreover, Assumption A(l) is a variant of the 
“drift condition” (2.4) in [28] for homogeneous Q-processes. 
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Under Assumption A, by Theorem 3.2 in [11] we see that a Q-process with 
transition rates q{D\x,'Kt) is regular, that is, S) = 1 for all x G S' 

and t > s > 0. Thus, under Assumption A we write the regular Q-process 
Plnn{s,x,t,D) simply as p^{s,x,t,D). 

We now state the optimality problem with which we are concerned. 

For a given (initial) distribution on S at s > 0 and each fixed pol¬ 
icy TT = (iTt) G n, let p^{s,x,t,D) be the regular Q-process. Then, as in 
the proof in [15], we can show the existence of a unique probability space 
(n,0(11),with O = (S X Let ^(t) and r]{t) denote the state 

and action processes, respectively (i.e., the coordinate processes defined on 
U), and let denote the expectation operator associated with ■ We 
write for P^‘ and for when pg is the Dirac measure at x G S. 
Moreover, let 

(2.5) r(x, TTt) := [ r(x, a)7rt{da\x) for all x G S and t > 0. 

Ja{x) 

We will write r(x,7rt) as r(x,/(x)) when vr = (vTf) =: f € F. Then we have 
the following lemma. 


Lemma 2.2. Suppose that Assumption A holds. Then, for each x G S 
and vr = (vr^) G LI; 

(a) For allt>s>0, Ep^r{^{t),r]{t)) = Jsr{y,7rt)p'^{s,x,t,dy). 

(b) The element E^’^r{f{t),r]{t)) is Borel measurable in t {t>s> 0). 


Proof. Part (a) follows from a similar proof in [15], while part (b) 
follows from (a) and (2.5) because p{s,x,t, D,'k) is continuous in t > s > 0; 
see [8] . □ 


For each x G 5 and vr G LI, the expeeted average criterion V (x, vr) is dehned 
as 


( 2 . 6 ) 


V (x, vr) := liminf 

T—^CX) 


T 


Definition 2.3. A policy vr* in n is said to be (average) optimal if 
V (x, vr*) > V (x, vr) for all vr G 11 and x G S. 


The main goal of this paper is to give conditions for the existence of an 
optimal stationary policy. 

For each x G S, s >0 and vr := (vr*) G IT, we denote by Ef,^ the expectation 
operator associated with the probability measure which is completely 
determined by p^{s,x,t,D). Then by pages 107-109 in [10] (or by Theorem 
14.4, page 121 in [38] and the homogenization technique in [5]) there exists 
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a Borel measurable Markov process x{t) (t > 0) with values in S. Obviously, 
the so-called state process x{t) is a continuous-time jump Markov process; 
its transition function is p^{s,x,t,D) determined by the transition rates 
q{D\x,Trt). 

Then by Lemma 2.2, (2.6) and (2.5) we have 


(2.7) 


V {x, tt) = liminf 

T —>00 


Io[Ela;r{x{t),7rt)]dt 

T 


Here, we understand that x{t) is any sample path and these sample paths are 
distributed according to Hence, these sample paths have a dependence 
on TT, s and x. However, such dependence will be dropped for simplicity when 
there is no confusion. 

By (2.7), w.l.o.g. we will limit ourselves to use x{t) and the corresponding 
and throughout the following discussion. In particular, let Pjj := 
Pix and El := E^^. 

3. Preliminaries. In this section, we give some preliminary lemmas that 
are needed to prove our main results. 


Lemma 3.1. Suppose that Assumption A holds. Then, for each vr G H, 
Elw{x{t)) < e~^w{x) + - y X € S and t>0 

with w{x), c and b as in Assumption A. 

For the proof, see Theorem 3.1 in [11]. 

To prove our main results, in addition to the previous result, we also need 
some facts on the a-discounted criterion defined by (3.1) below. For each 
discount factor a > 0,x G 5 and tt G H, the a-discounted criterion Ja{x,7r) 
and the corresponding optimal discounted value function J^ix) are dehned 
by 

POO 

(3.1) Ja{x,'7r) := / e~°^^[Elr{x{t),'irt)]dt and J*(x) := sup Jq(x, vr), 

J 0 ttGII 

respectively. 

A policy TT* in H is said to be a-discounted optimal if Ja{x,7r*) = Jq(x) 
for all X G S'. 

To ensure the hniteness of both V{x, vr) and Ja(x, vr), and the existence of 
a-discounted optimal stationary policies, we give the following conditions. 

Assumption B. (1) For each x G S, A{x) is compact. 

(2) For each fixed x G S, r{x,a) is continuous in a G A(x), and the func¬ 
tions Jg u{y)q{dy\x, a) are continuous in a G A{x) for all bounded measurable 
functions rt on S and also for u:=w as in Assumption A. 
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(3) For all a G A{x) and x G 5, |r(x,a)| < Mw{x) with some constant 
M > 0. 

(4) There exist a nonnegative measurable function w' on S, and constants 
c' > 0,6' > 0 and M' > 0 such that [with q{x) as in (2.3)] 

q{x)w{x) < M'w'{x) and / w'{y)q{dy\x,a) <c'w'{x)+b' \/{x,a)^K. 

Js 

Remark 3.1. Assumptions B(l) and B(2) are similar to the standard 
continuity-compactness hypotheses for discrete-time MDPs; see, for instance, 
[21, 31] and references therein. Under Assumptions A and B(3), by Lemma 
3.1 we see that the values U(x,7r) and Ja{x,7r) are both finite. Assumption 
B(4) allows us to use the Dynkin formula. On the other hand, if q{x) or 
r{x,a) is bounded, then Assumption B(4) is not required. 

Lemma 3.2. Under Assumptions A and B, the following statements 
hold, with a > 0. 

(a) For all x £ S and tt G R, |J„(x,7r)| < ■^^w{x) -|- 

(b) The optimal discounted value function J*{x) satisfies the optimality 
equation 

(3.2) aJ*{x)= sup \r{x,a) + [ J*{y)q{dy\x,a)\ VxGS'. 

aGA{x) I S J 

(c) There exists an a-discounted optimal stationary poliey f*£F. 

For the proof, see Theorem 3.3 in [11]. 

To state our final conditions, we need to introduce the concept of the 
weighted norm used in [20, 21]. For any fixed measurable function h>l 
on S, a function u on 5 is called /i-bounded if the weighted norm of u, 
|]u||/j := sup 2 ,g 5 is finite. Such a function h will be referred to as a 
weight function. We denote by Bh{S) the Banach space of all /i-bounded 
measurable functions u on S. 

Assumption C. There exist two functions vi,V 2 G i?^„(5') (with w as in 
Assumption A) and some state xq G S such that 

vi{x) < ha{x) < V 2 (x} yX £ S and a > 0, 

where ha{x) := Ja{x) — J^ixo) is the so-called relative difference of the op¬ 
timal discounted value function J^(x). 

Remark 3.2. (a) Assumption C is a variant of the conditions for discrete¬ 

time MDPs; see (SEN2) on page 132 in [34] and Assumption 5.4.1(b) in [20], 
for instance. 
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(b) It should be noted that the function vi in our Assumption C may 
not be bounded below, and so the ha{x) may not be bounded below either. 
However, the corresponding ha{x) in [20, 34] is assumed to be bounded below. 

To verify Assumption C, we now provide some sufficient conditions. 


Lemma 3.3. Under Assumptions A and B, each one of the following 
conditions (a) and (b) implies Assumption C. 


(a) For each f € F there exists a probability measure pj on B{S) such 
that 


(3.3) 


El[u{x{t))] - / u{y)nf{dy) 


< Re ^^w{x) \/\u\ <w and t>0, 


where R> 0 and p> 0 are constants independent of f. 

(b) For some integer d>l, S := [ 0 , 00 )*^ and q{x) is locally bounded on 
S. Moreover, the following conditions are satisfied: 


(bi) Drift condition. The function w in Assumption A is nondecreas¬ 
ing in each component and, moreover, 

w{y)q{dy\x, a) < —cw(x) + bl^Qdj(x) V (x, a) G K, 
where := (0,0,..., 0) G S. 

(b2) Monotonicity condition. For each Xk £ S,ak € A{xk) {k = 1 , 2 ) 
and monotone set D [i.e., Id{x) is increasing in x £ 5], if xi < X2 and 
X2 ^ F), then q{D\xi,ai) < q{D\x2,a2), and q{D^\xi,ai) > q{D^\x2,a2) when 
xi < X2 and xi £ D, where q{D\xk,ak) := q{D\xk,ak) - q{{xk}\xk, ak)lD{xk) 
and : = S — D is the complement of set D. 



Proof, (a) Whereas w{x) > 1 and |r(x,a)| < Mw{x) for all x G S' and 
a £ A(x), by Lemma 3.2(c) and (3.3) we have 

fOO 

\haix)\= I e-‘^^[El‘^r{x{t),f*{x{t))) - E[^r{xit)J*{x{t)))]dt 


10 

< MR 
^ MR 


P 


/•LXJ 

/ e“("+^)*[tt;(x) + u;(xo)] dt 

Jo 

[1 + rc(xo)]?u(x) =: V 2 {x), 


which verifies Assumption C with ui(x) := —V 2 {x). 

(b) By Theorem 5.47 in [4], we see that for each f £ E the corresponding 
Markov process x(t) is stochastically ordered. Moreover, for each x £ S, f £ 
F and |tt| < w, from the proof of (7.1) in [28] and condition (b), we have 


EJ[u{x{t))]- / u{x)fif{dx] 


< 2e 


—ct 


w{x) + ■ 


<2 1 + 


-.—ct 


w{x), 
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which gives condition (a) and so Assumption C follows. □ 

Obviously, Lemma 3.3 is also true when S = [ 0 , 00 )“^ and ™ 

dition (bi) are replaced with S = [/3i,oo) x ••• x [/3rf,oo) and I^^oy{x), re¬ 
spectively, where := (/3i,..., Pd) ^ S,Pi>0 (i = 1,..., d). 

The validity of conditions (a) and (b) in Lemma 3.3 can also be obtained 
in several ways. For instance, [28] uses Assumption A and monotonicity 
conditions. Other approaches that yield exponential ergodicity (3.3) can be 
seen in [3, 7, 36, 40], for instance. 

To prove our main results by using the Dynkin formula, we need the 
following facts from Lemma 5.2 in [11]. 


Lemma 3.4. Suppose that Assumptions A and B hold. Take arbitrarily 
TT := (vTt) G n and x & S. 

(a) For each u G (with w and w' as in Assumptions A and B): 

(ai) ||£^J|u(x(t))|[|^+^/ < vt>0; 

(a 2 ) limt\se-^= E((yu{x{t)) = limt\s E((u{x{t)) = u{y) for 

all y € S and s > 0. 


(b) For each u G and t> s>0: 

(bi) L'^u{s,x) :=limtiot~^[El^u{x{s + t)) -u{x)] = Jgu{y)q{dy\x,Trs); 

(b2) 


\EfJL^u{t,x{t))\\\^+^. < ll»IUF+c'+y'+2M0^ ^(c+c0d-s)_ 


Lemma 3.4 shows that L'^ in Lemma 3.4(b) is the extended generator of 
the Q-process p'^{s,x,t,D) and that the domain of L'^ contains B^u{S). 

Finally, for ease of reference, we state a “measurable selection theorem” 
from [20, 33]. 


Lemma 3.5 (A measurable selection theorem). Let C{A) be the collec¬ 
tion of all nonempty compact subsets of A, and let D be a multifunction 
from S to C{A) such that K := {{x,a)\x £ S,a£ E>{x)} is a Borel subset of 
S X A. Ifv{x,a) is a real-valued measurable function on K such that v{x,a) 
is continuous in a € E>{x) for each x £ S, then there exists a measurable 
function f : S ^ A such that f{x) £ D{x) for all x £ S and 

v{x,f{x))= max v{x,a) for each x£S. 

aeD(x) 

Moreover, the function v*{x) := max„g£)( 3 ,) v{x, a) is measurable in x £ S. 

Lemma 3.5 will be used to prove the existence of an optimal stationary 
policy. 
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4. The main results and proof. In this section, we prove our main results. 


Theorem 4.1. Suppose that Assumptions A, B and C hold, and vr = 
(vTt) is in H. 

(a) If there exist a constant g and a function u G Bw{S) such that 

g>r[x,'Kt)+ / u{y)q{dy\x,'Kt) VxG 5 andt> 0 , 

Js 

then g >V{x, vr) for all x G S. 

(b) Similarly, if there exist a constant g and a function u € B^{S) such 
that 

g<r{x,TTt)+ / u{y)q{dy\x, 7 rt) VxG 5 andt> 0 , 

Js 

then g <V{x, vr) for all x G S. 

Proof, (a) For each x G S and T >0, under condition (a), by Lemma 
3.4 and the Dynkin formula (page 141 or 146 in [9]), we have 


EfuixiT))-uix)=E^, 


(4.1) 


<Tg-El 


LAu{t, x{t)) dt 

T 


Uo 


r{x{t),TTt)dt 


= Tg- [ Elr{x{t),-Kt)dt. 

Jo 

On the other hand, by Lemma 3.1 we have 


F;-|u(x(r))|<||u|| 


e w{x) + - 
c 


which together with (4.1) and (2.7) gives (a). 

(b) Similarly, we can prove (b). □ 

Theorem 4.2. Suppose that Assumptions A, B and C hold. 

(a) There exist a constant g*, functions ul,U 2 G Bi^(S) and a stationary 
policy f*GF that satisfy the two optimality inequalities 

(4.2) g* > sup |r(x,a)+ [ ul{y)q{dy\x,a)\ \fxGS; 
aGA(x) ^ S J 


(4.3) g* < sup [r{x,a)+ [ U 2 {y)q{dy\x,a)\ ^xGS, 

aGA{x) I S J 

(4.4) =r(x,f*(x))+ [ ul{y)q{dy\xj*{x)) VxGS. 

Js 
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(b) For all x ^ S, g* = sup^gn ^tt) = V(x, /*). 

(c) Any stationary policy f & F that realizes the maximum of (4.3) is 
optimal, and so f* in (4.4) is an optimal stationary policy. 


Proof, (a) Let xq € S he as in Assumption C and let {an} be any 
sequence of discount factors such that —> 0 as n ^ oo. By Lemma 3.2(a), 
\anJa„i^o)\ is bounded in n > 1. Therefore, there exist a subsequence {ok} 
of {an} and a constant g* that satisfy 

(4.5) lim OfcJ* (xo) =/, uj (x) ;= liminf (x). 

k—^oo k—^'CO 

Since |/iq,(x)| < |ui(x)| + |x 2 (x)| for all x G 5" and a > 0 (by Assumption C), 
by (4.5) we have 

ul€Bw{S) and lim (x) = 0 Vx^iS. 

k^oo 

On the other hand, take any real-valued measurable function m on S' such 
that m(x) > q{x) > 0 for all x G S. Then, for each x G S and a G A(x), by 
P 1 -P 3 , we see that P(-|x,a) defined by 

(4.6) P{D\x,a):=— — j. ’ ^ Id{x) for all D G B(S) 

m[x) 

is a probability measure on B(S). In fact, by Pi we see that P{D\x,a) := 
^d{x) is completely additive in D G B{S). Moreover, by P 3 we 
also see that P(S|x,a) = 1. Thus, it suffices to show that 0 < P{D\x,a) < 1 
for all D G B{S). For x ^ D, by Pi,P 2 and P 3 , 

-q{{x}\x, a) = q{S - {x}|x, o) 

= q{D\x, a) + q{S — D — {x}|x, a) 

> q{D\x, a) > 0, 

which together with (2.3) and m(x) > q{x) gives that P{D\x,a) = G 

[0,1) (since x ^ D). When x G D, it also follows from Pi,P 2 and P 3 that 
—g({x}|x, a) > —g({x}|x, a) — q(D — {x}|x, a) 


= —q(Dlx, a) = q{S — D\x, a) >0, 

which together with (2.3) and m(x) > q{x) yields that — 1 < < 0, and 

so P{D\x, a) = + 1 e [0,1) (since x G L>). 

Noting that ha{x) = Jq(x) — J*(xo), by (4.6) and P 3 we can rewrite (3.2) 
as 

aJ*(xo) aha{x) 


(4.7) 


m{x) 




= sup 

a^A(x) 


+ ha{x) 

m[x) 
r(x, a) 


m{x) 


ha{y)P{dy\x,a) 


VxgS. 
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Thus, for each fe > 1 and x G S', by (4.7) we have 
akJg^^ixo) ^ akha^{x) 

(4.8) 


+ 


+ ha^{x) 


m(x) m(x) 

rix,a} r VaG^(x). 

*/ s 


> 


m{x) 


Applying the extension of Fatou’s lemma (8.3.7 in [21]), by (4.8) and (4.5) 
we get 

r(x, a) 

^ IS 


9^ 


+ ^i(3j) > ^ + / ul{y)P(dy\x,a) Vx G S and a G A(x). 

mix) Js 


m(x) m{x) 

This together with (4.6), yields 

g*>r{x,a)+ / ul{y)q{dy\x,a) Vx G S and a G A(x), 
Js 

which gives 

(4.9) g* > sup \r{x,a)+ [ ul{y)q{dy\x,a)\ VxGS, 

aGA{x) I S J 

and so (4.2) follows. 

To prove (4.3), for each x G S and A: > 1, let 

(4.10) U 2 (x) :=limsupha^^(x), ga^ix) := sup{ha^(x): m > k}. 

k—^oo 

Then we have 

(4.11) tt 2 G .B^(S), U 2 (x) = lim 5 rQ,j^(x) and ga^ix) > ha^^{x), 

k^oc 

which together with (4.7) gives 

(^o) I (^x) 


(4.12) 


m(x) m{x) 

' r(x, a) 


+ ha^{x) 


= sup , , . < , 

a&A{x) I rn{x) Js 

< sup 


+ / Kk{y)P{dy\x,a) 

Js 

+ j^9at,{y)P{dy\x,a)'^. 


eA(x) I mix) 

Since < g^^ for all A: > 1, limfc^oo[sup„g^( 3 ,){^§gJ + / 5 ff„Jy)P((iy|x,a)}] 
exists. Thus, by (4.5), (4.10) and (4.12) we have 

9* 


(4.13) 


m(x) 


PAix) 


< lim 


r 

'r(x,a) f 

sup 


- a^A{x) 

m[x) Js 


+ / 9ak{y)P{dy\x,a] 


VxgS. 
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Also, for each fixed x G S and A: > 1, by Assumption B, there exists ak{x) G 
A{x) such that 


sup 


(4.14) 


r(x, a) 


+ / 9au{y)P{dy\x,a] 
J s 


a^A{x) I fTlyX) JS 

r(x,ak(x)) f / X _/ , , /XX 

=-/ 9akiy)Pidy\x,ak{x)). 

m{x) Js 


Since A{x) is compact, there exists a subsequence {af^.(^x)} of {afc(x)} such 
that limj^oo Ofci(aj) =: a'{x) G A{x). Noting that Hfi'afclU < ||?^i|U + ||^^ 2 |U for 
all A; > 1, by (4.13), (4.14) and the extension of Fatou’s lemma (8.3.7 in [21]) 
we obtain 


m{x) 


+ U2 i^) ^ 


sup 


= lim 
i^oo 


r{x, a) 
i£A{x) I TRyX) 

r{x,aki{x)) 


m{x) 
r{x, a'{x)) 


+ / 9akXy)P{dy\x,a) 

Js 

/ gakAy)Pidy\x,aki{x)) 


m{x) 


+ / U 2 {y)P{dy\x,a {x)) 


< sup 

a£A{x) 


r(x, a) 
m{x) 


U2iy)Pidy\x,a] 


which yields (4.3). 

Moreover, by Assumption B and the extension of Fatou’s lemma (8.3.7 
in [21]), we see that v{x,a) := r{x,a) + JgU 2 {y)q{dy\x,a) is continuous in 
a G A[x) for each x G S. Then, Lemma 3.5 with D{x) := A{x) together with 
(4.3) gives the existence of f* (g F), satisfying (4.4). Thus, the proof of (a) 
is complete. 

(b) For each tt = (vr^) G B, from (4.2) we get 

g*>r{x,a)+ / ul{y)q{dy\x,a) Va G A(x) and x G 5, 

Js 

which together with (2.4) and (2.5), gives 

g*>r{x,7rt)+ / ul{y)q{dy\x,TTt) yt>0 and x G S. 

Js 

Thus, by Theorem 4.1(a) with u:=ul, we have 

g* >V{x,7r) V X G 5 and tt G B, 

and so 


(4.15) 


5 * > supF(x, tt) Vxg5. 

ttGII 













CONTINUOUS-TIME MARKOV DECISION PROCESSES 


15 


Similarly, by (4.4) and Theorem 4.1(b) with u = U 2 , we have 
(4.16) g* <V{xJ*) VxGS. 


By (4.15) and (4.16) we have g* = V{x,f*) = supj,.gn ^(^J) ^) for all x G 5, 
and so (b) follows. 

(c) Obviously, (c) follows from the proof of (a) and (b). □ 

Remark 4.1. (a) From the proof of Theorem 4.2 we see that the ap¬ 

proach used to prove Theorem 4.2 is different from the optimality inequality 
approach (e.g., [16, 19] for continuous-time MDPs and [20, 21, 31, 34] for 
discrete-time MDPs). In fact, there are two key steps in the proof of the 
existence of an (average) optimal stationary policy by using the optimality 
inequality approach. The first step is to obtain an inequality as in (4.15) by 
the Abelian theorem (e.g., [19, 37]), relating the average criterion V{x,'k) to 
the discounted criterion Ja{x,TT). The other step is to get another inequal¬ 
ity as in (4.16) from the optimality inequality as in (4.4). However, to use 
the Abelian theorem, the reward (or cost) rates have to be nonpositive (or 
nonnegative). Therefore, the optimality inequality approach in the previous 
literature is not applied to our case because in our model the reward rates 
may have neither upper nor lower bounds. 

(b) From the proof of Theorem 4.2(a), we also see that properties Pi- 
P 3 about the transition rates play a particular role. In fact, without these 
properties, we can neither define the probability measure P{-\x,a) in (4.6) 
nor prove Theorem 4.2(a) by applying the extension of Fatou’s lemma (8.3.7 
in [21]) to the right-hand sides of (4.8) and (4.14). On the other hand, it does 
not seem to be possible to prove the existence of (average) optimal feedback 
policies in controlled stochastic differential equations (SDEs) (see [2, 19], 
for instance) by using the above approach because in controlled SDEs such 
properties P 1 -P 3 fail to hold. 

Theorem 4.2 ensures the existence of an (average) optimal stationary 
policy. We now gives an interesting semimartingale characterization of such 
a policy. 

For each x £ S, f € F,u £ Bw{S) and any constant g, let 



and define a continuous-time stochastic process 


(4.18) 



for each t > 0 . 


Theorem 4.3. Suppose that Assumptions A, B and C hold. 
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(a) If f* is the optimal stationary policy obtained in Theorem 4.2, and 
ul,U 2 and g* are from Theorem 4.2, then: 

(ai) For all x € S, and {Mt{f*,U 2 ,g*),iFt} is a Pji -submartingale. 
(a 2 ) For all f € F andxGS, {Mt{f,ul,g*),iFt} is -supermartingale. 

(b) Conversely, if there exist a poliey f € F, functions Ui,U 2 G Bw{S) 
and some constant g such that: 

(bi) {Mt{f,U 2 ,g),Pt} is a P^-submartingale for all x ^ S and 
(b 2 ) {Mt{f,u[,g),Pt} is PI-supermartingale for all f ^ F and x G 5, 
then the stationary policy f is (average) optimal. 


Proof. For each f £ F,u£ Bw{S),x G S and constant g, we have 

El[Mt{f,u,g)\Ps] 

(4.19) 


= Msif,u,g) + eI [ l^{x{y)-,f,u,g)dy\Ps 

J S 

In fact, from (4.17) and (4.18), we have 


Vt > s > 0. 


El 


^{x{y)-J,u,g)dy\Ps 


(4.20) 


= eI 


+ El 


r{x{y),f{x{y)))dy\Ps 
f H{x{y)-J,u)dy\Ps 

J S 


- it-s)g, 


where H{x]f,u) := fgu{y)q{dy\x,f{x)). Using the Markov property, we ob¬ 
tain 


El 


H{x{y);f,u)dy\Ps 


= E^, . 

x{s) 


H{x{y)]f,u)dy 


which together with Lemma 3.4 and the Fubini’s theorem gives 

(4.21) eI H{x{y)-J,u)dy\Ps = iL(x(y);/,«)] dy. 

Applying Lemma 3.4 and the Dynkin formula (e.g., page 141 or 146 in [9]), 
from (4.21) we obtain 

(4.22) eI H{x{y)-J,u)dy\Ps = Bl^^~^u{x{t)) - u{x{s)). 

Thus, replacing (4.22) into (4.20) we get 


E^ 


(4.23) 


A(x(y);/,u,5r)dy|J^s 


= bI J r{x{y),f{x{y)))dy\Ps 

, pf / /, N 
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On the other hand, from (4.18) and the Markov property we have 
El[Mt{f,u,g)\Ts] 

(4.24) = Msif,u,g) + El j r{x{y)J{x{y))dy\Es 

- u(x(s)) + El^^^u(x(t)) -(t- s)g. 

Finally, use (4.24) and (4.23) to obtain (4.19). 

(a) For each f £ F and x £ S, from (4.2) and (4.17) we have 


‘^{x;f,ul,g*) < 0 , 

which together with (4.19) implies that {Mt{f,ul,g*),Et} is P/-super- 
martingale. Similarly, we see that {Mt{f*,U 2 ,g*),Et} is a P/ -submartingale 
and so (a) follows. 

(b) For each x £ S, f £ F and u £ Bw{S), taking expectations in both 
sides of (4.19) gives 


(4.25) 


ElMtif, u, g) = ElMsif, u, g) + eI 


Mx{y);f,u,g) 


dt 

yt>s>o. 


By Lemma 3.4 and (4.17), A{-;f,u,g) belongs to P^+^/(5). Thus, by con¬ 
dition (bi) we have 


eI[M t (/, u'2,g)]> eI [M,(/, u' 2 , )] 
Then, by (4.25) and Fubini’s theorem, we get 


Vt >s>0. 


El 


^{x{y);f,u2,g)dy 


= / El[A{x{y);f,U2^g)]dy 


>0 Vt>s>0 


and so 

ElA{x{t)-,f,U2,g)>0 Va.e. t>0. 

Therefore, there exists a sequence J, 0 as n ^ 00 such that 

(4.26) ElA{x{tn)',f,U2,g)>0 Vn>0andxG5. 

Since A{-; f,U 2 ,g) £ letting n ^ 00 in (4.26), from Lemma 3.4(a) 

we get 

^ix-f,U2,g) > 0 

and so 

(4.27) g <r{xj{x)) + f U 2 {y)q{dy\x, f{x)) '^x£S. 

Js 
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Then, by (4.27) and Theorem 4.1(b), we get 

(4.28) 9 <V{xJ) VxG5. 

Similarly, as in the proof of (4.27), by condition (b 2 ) we have 

g>r{x,f{x))+ [ u'^{y)q{dy\xj{x)) yx € S and f e F 

Js 

and so 

g>r{x,a)+ [ u[{y)q{dy\x,a) yx G S and a e A{x). 

Js 

Then, by (2.3), (2.5) and Theorem 4.1(a) we have 

(4.29) g>snpV{x,TT) VxGS'. 

TTEn 

Combining (4.28) with (4.29) gives 

V{xJ) = supC(x,7r) 

ttEII 


and so (b) follows. □ 


Theorem 4.3 gives a semimartingale characterization of an optimal sta¬ 
tionary policy. 


5. Examples. In this section we will use five examples to illustrate our 
conditions and results. 


Example 5.1 (Optimal control of generalized Potlach processes in [4, 
22]). The generalized Potlach process [4, 22] is a Q-process generated by 
the infinitesimal operator L defined by (5.1) below. Here we are interested 
in the following optimal control problem. 

Take S := [ 1 , 00 )*^ with an integer d > 1. Then the generalized Potlach 
process can be generated by the operator L defined by 


Lu{x,ai) :=Y^ 


(5.1) 


2 = 1 ' 


uyx - CiXi + yY^^PijXiejj - u{x) 


dFxiy) 
for X € S, 


where oi := {pij) is a Markov transition matrix on {l,2,...,d}, is the 
ith unit vector in R'^ and F\{y) is a real-valued distribution function with 
a parameter A, which can be regarded as a fixed reward fee. When the 
process is at state x = (xi,..., Xd) G S, the cost incurred at each component 
Xi is presented by qi G [0,(?*], where q* >0 for all z = 1,... ,d. Let 02 := 
{qi, ■ ■ ■ ,qd)- Here we interpret the parameters ui and 02 as an action a := 
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(ai, 02), which belongs to a set Ai x A2 of available actions. Suppose that Ai 
is a finite set of Markov transition matrices {pij), A2 := x • • • x [ 0 ,g^], 

Fx{y) := (1 — e“^^)/[o,oo)(l/) with A > 1 and, for each f ^ F, G S 

such that < x ^‘^'1 with the semiorder, x[^^ J 2 'j=iPijSj < 
for all {pjj), (Pij) G Ai and i = I,... ,d. For each D G B{S),x G S and oi = 
{pij) G Ai, let 

( 5 . 2 ) q{D\x,ai) Id\{^}\^x - CiXi + yY^^pijXiCj j Xe~^y dy. 

Then, for each x £ S and a = (01,02) £ A\= Ai x A2 with oi := {pij) and 
®2 := (91, ■ • • j (Id)-, the transition rates q{D\x., a) and the reward rates r{x, a), 
which may depend on given parameter A, are dehned by 

( 5 . 3 ) q{D\x,a) := q{D\x,ai) - lD{x)q{S\x,ai) 
and 

d d 

( 5 . 4 ) r(x,a):= EE qiPijXj - A(xi H- hxd), 

i=ij=i 

respectively. 

For each x = (xi,..., xfi) £ S, let w{x) := xi + X2 H-hx^. Then, by ( 5 . 3 ) 

we have 


(5.5) 


q{x) := sup [—g({x}|x, o)] < d. 

a£A{x) 


Moreover, by (5.2) and (5.3) we have 
(5.6) 


w{y)q{dy\x, o) < (xi H-h Xd) 

(A-1) . ■ 

— -w[x) 


A(y-l)e ^^dy 


A 


which together with (5.5) verifies Assumption A with c := and 6 = 0. 


_ (A-l) 

~ A 

By (5.4) we have |r(x,o)| < d{ql + ■ ■ ■ + QdF X)w{x) for all x G S' and 
a £ A, which together with (5.5), Remark 3.1 and the finiteness of Ai, implies 
Assumption B. 

Finally, we verify Assumption C. In fact, let D be any monotone set in S 
[i.e., Id{x) is increasing in x]. For each f £ F, x^^'l,x^'^'l £ S such that x^^^ < 
x(2). Let /(x(i)) =: ((ph), o^^^),/(x^^)) =: ((pf^), o^^^). Then x[^^ - 

xf^ Y^^j=iPij^j - Thus, for each i£ {1,2,...,d} and y > 0 , we have 

£}{i,yj{x^^^)) := (x^^)-xf^e*)+ yxf^^phej 

1=1 
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< 


i (2) (2) \ I (2) 2 

- x\ ’e-i) + yx\ 2 ^Pijej 


i=i 


which together with the monotonicity of set D, gives 

C'^{i,yJ{x^^'>))eD if C^{i,yJ{x^^'>))eD 

and 

Thus, if x^^\x^'^'> ^ D, by (5.2) we have 

^ POO 
j=l-^0 

(5.7) 

< 

i=i‘ 


roo 

lD{f{i,yJ{x^'^^)))Xe~^ydy 

= g(I)|x(2),/(x(2))), 

and if G D, we also have 


^ POO 

g(I)'=|xW,/(xW)) = ^ / lD^{e{i,yJ{x^^'^)))Xe-^ydy 

j=iJo 

>i: r iD<e(i,vj{x''H)xe->->dv 

= g(Z)=|x(2),/(x(2))), 

and so it follows from Lemma 3.3(b) that Assumption C holds. 


By the discussions above, we see that for Example 5.1 all conditions in 
this paper are satisfied. It should be noted that in Example 5.1 the state 
space is not denumerable and the reward rates have neither upper nor lower 
hounds', see (5.4). Therefore, the earlier conditions in [5, 6, 13, 16, 17, 18, 19, 
23, 24, 26, 27, 30, 31, 34, 35, 39, 41] fail to hold because, except in [5, 19], the 
state spaces in the previous literature are all denumerable, while the reward 
rates in [5] and cost rates in [19] are uniformly bounded and bounded below, 
respectively. 


Example 5.2 (Optimal control of birth-death systems in [1, 4]). Con¬ 
sider a controlled birth-death system in which the state variable denotes 
a population size at any time t >0. The birth rate is assumed to be a 
fixed constant A > 0, but the death rates p are assumed to be controlled by 
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a decision-maker. Here we interpret any death rate /r as an action a (i.e., 
^ =: a). When the system’s state is at x G S' := {0,1,...}, the decision-maker 
takes an action a from a given set A{x) = [^ 1 , 112 ] with ^2 > > 0, which 

increases or decreases the death rates given by (5.10) and (5.11) below. This 
action incurs a cost at rate rc{x,a). In addition, suppose that the benefit 
caused by each population is presented by p > 0 for each unit of time, and 
then the decision-maker gets a reward at rate px for each unit of time during 
which the system remains in state x. 

We now formulate this system as a continuous-time Markov decision pro¬ 
cess. The corresponding transition rates q{y\x,a) are given as 


(5.9) 

(5.10) 


9 ( 110 , 0 ) =-9(0|0,o) := A VaG[pi,p 2 ], 
9(0|1,o);=a, 9(1|1, o) = —o — A, 

9(2|1,o):=A VaG[/ii,p2]- 


For each x > 2 and a G A{x) = [pi,pL 2 ]-, 


(5.11) 


q{y\x,a) := < 


piox, 

P2ax, 

-{a + A)x, 
Ax, 


lo. 


if y = X - 2, 
if y = x- 1 , 
if y = X, 

if y = X -b 1, 
otherwise. 


where pi > 0 and P 2 > 0 are fixed constants and pi -|-p 2 = 1 - 

By the model’s description we see that the reward rates r(x, a) are of the 
form 


(5.12) r{x,a) := px — rc{x,a) for (x, a) G iF := {(x, o); x G 5, a G H(x)}. 

We aim to find conditions that ensure the existence of an (average) optimal 
stationary policy. To do this, we consider the following assumptions: 

El : There exists — A > 0. 

E 2 : There exists pi < with pi as in (5.11). (This condition obviously 
holds when pi = 0 .) 

E 3 : The function rc{x,a) is continuous in a G A{x) = [pi,fj, 2 ] for each fixed 
X G 5, and c*(x) := supQg^( 3 ,) |rc(x,a)| < M(x -|- 1) for all x G 5 and 
some constant M > 0. 


Under these conditions, we obtain the following. 

Proposition 5.1. Under Ei, E 2 and E 3 , the above controlled birth- 
death system satisfies Assumptions A, B and C. Therefore (by Theorem 
4.2 ), there exists an optimal stationary policy. 
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Proof. We shall first verify Assumption A. Let c := — A) > 0 (by 

El) and let w{x) := a; + 1 for all x € S. Then, from (5.9) and (5.10) we have 


(5.13) q{y\0,a)w{y) = X < —cw{0) + fii + X ya^A{x), 
yeS 

(5.14) ^ g(y|l,a)ra(y) = —(a — A) < —cra(l) VoGA(x). 
yes 

Moreover, for each x>2 and a G [/ii,/i 2 ], from (5.11) we have 


(5.15) 


Q{y\x, a)w{y) = -{a + api- X)x 

yeS 

< —|(a + api — X)w{x) 

< —cw(x). 


By (5.13)-(5.15) we have 


(5.16) 


J2Q(y\x,a)w(y) 

yes 

< -cw{x) + {pi + A)/{ 0 }(a;) 

< —cw(x) + Pi + X 


Va G A{x) and x G S', 
V a G A{x) and x € S, 


which gives Assumption A(l). On the other hand, by (5.9)-(5.11), we have 
q{x) < (/r 2 + A)(x + l) = {p 2 + X)w{x) and so Assumption A( 2 ) follows. Thus, 
Assumption A is true. 

To verify Assumption B, by (5.12) and E 3 we have |r(x,a)| < {p + M)w{x) 
for all X G S'. Thus, by (5.9)-(5.11) as well as E 3 we see that Assumptions 
B(l)-B(3) hold. To verify Assumption B(4), we let 

w'{x) := (x + l)(x + 2) for each x G S'. 

Then, by (5.9)-(5.11) we have 

q{x)w{x) < {p 2 + X)w'{x) \/xG5, 

q{y\x^ a)w'{x) < 6Xw'{x) Va G [pi,p 2 ] and x G S', 

yeS 

which imply Assumption B(4) with M' := {p 2 + A),c^ := 6X,b' := 0. 

Finally, we verify Assumption C. Since 0 < Pi < ^, by (5.9)-(5.11) we 
have that, for each fixed / G E, 

E9(y|x,/(x))<E q{y\x + 1, /(x + 1)) Vx, A; G S' such that /c / x + 1, 

y>k y^k 

which together with Theorem 3.4 in [1], implies that the corresponding 
Markov process x{t) is stochastically ordered. Thus, Assumption C follows 
from (5.16) and Lemma 3.3(b). □ 


CONTINUOUS-TIME MARKOV DECISION PROCESSES 


23 


Example 5.3 (Optimal control of upwardly skip-free processes in [1]). 
The upwardly skip-free processes, also known as birth and death processes 
with catastrophes, belong to the category of population processes [1], Chapter 
9, page 292, with the state space S := {0, 1 , 2 ,.. .}. Here we are interested in 
the average optimal control problem for such processes with catastrophes of 
two sizes, so the transition rates are of the form 


(5.17) 


q{y\x,a) := < 


' Xx + ai, 

-{Xx + px + d{x,a 2 ) + ai), 
px + d{x, 02 )^ 1 , 
d{x, 02 )^ 1 , 


lo, 


\iy = x + l, 
if y = x, 
if y = x - 1 , 
if y = X — 2 , 
others, 


where x & S,a:= (ai, 02 ), the constants A > 0,y > 0, immigration rates ai > 
0 ; d{x,a 2 ) are nonnegative numbers that represent the rates at which the 
“catastrophes” occur and which are assumed to be controlled by decisions 02 
in some compact set B{x), when the process is in state x > 1; the numbers 
7 ^ and 7 ^ are nonnegative and such that 7 ^ + 7 ^ = 1 for all x > 1 and 
7 ^ = 0 ; and 7 ^ is the probability that the process makes a transition to 
x — k {k = 1, 2 ), given that a catastrophe occurs when the process is in state 
X > 2. For state x = 0, it is natural to let d{t),a 2 ) = 0 and 7o = 7o = 0- 
On the other hand, we suppose that the immigration rates ai can also be 
controlled and so we interpret a := ( 01 , 02 ) as an action. Thus, we may let 
the admissible action sets H(0) := [0,b] and A{x) := [0,6] x B{x) for x > 1, 
with some constant 6 > 0. In addition, suppose that the damage caused 
by a catastrophe is represented by p > 0 for each unit of time and that it 
incurs a cost at rate c(x,a 2 ) to take decision 02 G B{x) at state x > 1. Let 
c( 0 , •) := 0 . Also, we assume that the benefits obtained by the transitions to 
X — 1 and X — 2 from x (> 2) are represented by positive constants qi and 
q 2 , respectively, and the benefit caused by each oi G [ 0 , 6 ] is represented by 
a real number r(oi). Then the reward rates are of the form 

r(x, o) := f (oi) - c(x, 02 ) - pd{x, 02 ) -|- qi'y].d{x, 02 ) -|- q 2 'yld{x, 02 ) 


for all o = (oi, 02 ) G A{x). As in the verifications of Assumptions A, B and C 
in Example 5.2, under the following conditions F 1 -F 3 , the above controlled 
upwardly skip-free processes satisfy Assumptions A, B and C, and, therefore 
(by Theorem 4.2), there exists an optimal stationary policy: 

Fi : For all X > 1, /r - A > 0; 7^+1 < 

F 2 : There exists 6 < A - p + m.i{^>i^a 2 ^B{x)}{d{x,a 2 )+'^ld{x,a 2 )}. 

F 3 : For each x € S, the functions r(ai) and c(x,a 2 ) are continuous in 
( 01 , 02 ) G A(x), and sup^^eSM I'^(a;,a 2 )| < Li{x + l),sup^^^B{x) |c(a:, 02 )| < 
L2(x -|- 1) for some constants Li > 0 and L2 > 0 . 
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In particular, all of F 1 -F 3 hold when X < fj.< b+X, f(ai) := Tai,d{x, 02 ) := 
2a2X, 7 ^ < ^ ^ and B{x) := [b, j3] for all x > 1, with some constants r > 0 

and (3>b. 

Example 5.4 (Optimal control of a pair of M/M/1 queues in tandem 
in [28]). Suppose that customers arrive as a Poisson stream with unit rate 
to the first queue, where they are serviced with mean service time a]~^. 
After service is completed at the first queue, each customer immediately 
departs and joins the second queue, where the mean service time is 
After service is completed at the second queue, the customers leave the 
system with state space S ;= {0, 1 , 2 ,.. .}^. Here, we interpret any given pair 
of mean service times ( 01 , 02 ) =: o as an action and let corresponding action 
sets A(xi,X 2 ) = ^ positive constants > /ri,/r 2 > /^ 2 - 

As in [28], let 

W{Xi,X2} ■—+(J2 +1^1 ^2 5 

where oi = 1.06 ,02 = 1.03 ,7 = 0.4, j3i = 1.5 and (32 = 0.3. Suppose that > 
3 and ^2 > 2. Then, when r(xi,X 2 ,o) is bounded in all (xi,X 2 ,o) and con¬ 
tinuous in o G A(xi,X 2 ) for each fixed (xi,X 2 ) G S, from the argument in 
[28] and Lemma 3.3(b), we see that Assumptions A, B and C are all satis¬ 
fied. In fact, under these parameter values, using the argument in [28] and a 
straightforward calculation, we can verify Assumption B and also Assump¬ 
tion A as well as the conditions (b) in Lemma 3.3 with c := 0.002 and b := 0. 
Therefore, there exists an average optimal stationary policy for this example 
with the above parameter values. 


Example 5.5 (Optimal control of M/M/N/0 queue systems in [25, 40]). 
Here the state space is S := {0, 1 , 2 ,..., A^} with some integer > 1 . Suppose 
that the arrival rate A is fixed but the service rates /i can be controlled. 
Therefore, we interpret service rates n as actions, which may depend on 
the current states x G 5. We denote by A{x) the action sets at state x G 5. 
Since there is no service in the queue at state 0, we may suppose that 
A( 0 ) := { 0 } for simplicity. Also, for each x > 1 we let A{x) := [^ 1 , 112 ] with 
constants /i 2 > > 0. Then, the transition rates are given as ( 7 ( 0 | 0 , 0 ) = 

—A = —g(l|0,0) and g(A^|A^, /x) = —Nfj, = —q{N — 1|A^, fi) for all fj, G A{N). 
Moreover, for each 1 < x < A^ — 1 and fi G A(x), 


q{y\x,ti) 


A, ify = x-|-l, 

-{X-\-fj.x), if 2 / = x, 
fix, if y = X — 1, 

0 , others. 


Thus, when fii> X and a reward rate function r{x,fi) is continuous in /x G 
A{x) for all x G S', as in the verification of Example 5.2, we see that this 
controlled M/M/N/D queue system satisfies Assumptions A, B and C. 
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Remark 5.1. In the verifications of Assumptions A, B and C for the five 
examples, a key step is the verification of Assumption C by using Lemma 
3.3(b). This is due to the advantage that the drift and monotonicity con¬ 
ditions of Lemma 3.3(b) are imposed on the primitive data of the model. 
Here, we should note that these conditions have to be uniform with respect 
to the actions. In fact, such uniformity is used to show that the exponen¬ 
tial convergence rate p and the constant R in (3.3) are independent of all 
stationary policies. On the other hand, other examples and approaches that 
yield exponential ergodicity (3.3) can be seen in [3, 7, 28, 36, 40], for in¬ 
stance. Finally, be warned that all of the underlying processes in this paper 
are continuous-time jump Markov processes, which can be determined by 
given transition rates (2.4) with the properties P 1 -P 3 . 

6. Concluding remarks. In the previous sections we have studied the av¬ 
erage optimality problem for continuous-time Markov decision processes in 
Polish spaces. Under suitable assumptions we have shown the existence of 
an optimal stationary policy. The approach developed to prove the existence 
of optimal stationary policies is different from the optimality inequality ap¬ 
proach widely used in the previous literature. In addition, we have presented 
a semimartingale characterization for an optimal stationary policy. On the 
other hand, we believe that our formulation and approach are sufficiently 
general and, thus, provide a way to analyze other important problems, such 
as the problems of bias optimality, Blaekwell optimality and stochastic games 
with average payoffs, which as far as we can tell have not been previously 
studied for continuous-time jump Markov processes with Polish spaces and 
unbounded transition rates. Research on these topics is in progress. 

To conclude, it is worth noting that under our present conditions we 
cannot establish the average optimality equation by using the usual diagonal 
argument, because the state space may not be denumerable. We will give 
additional conditions under which the average optimality equation also holds 
in an upcoming paper. 

Acknowledgments. We are very grateful to an Associate Editor and the 
anonymous referees for many fine comments and suggestions that have im¬ 
proved this paper. 
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